135 research outputs found

    The added value of text from Dutch general practitioner notes in predictive modeling

    Get PDF
    Objective:This work aims to explore the value of Dutch unstructured data, in combination with structured data, for the development of prognostic prediction models in a general practitioner (GP) setting.Materials and methods:We trained and validated prediction models for 4 common clinical prediction problems using various sparse text representations, common prediction algorithms, and observational GP electronic health record (EHR) data. We trained and validated 84 models internally and externally on data from different EHR systems.Results:On average, over all the different text representations and prediction algorithms, models only using text data performed better or similar to models using structured data alone in 2 prediction tasks. Additionally, in these 2 tasks, the combination of structured and text data outperformed models using structured or text data alone. No large performance differences were found between the different text representations and prediction algorithms.Discussion:Our findings indicate that the use of unstructured data alone can result in well-performing prediction models for some clinical prediction problems. Furthermore, the performance improvement achieved by combining structured and text data highlights the added value. Additionally, we demonstrate the significance of clinical natural language processing research in languages other than English and the possibility of validating text-based prediction models across various EHR systems.Conclusion:Our study highlights the potential benefits of incorporating unstructured data in clinical prediction models in a GP setting. Although the added value of unstructured data may vary depending on the specific prediction task, our findings suggest that it has the potential to enhance patient care

    Rewriting and suppressing UMLS terms for improved biomedical term identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.</p> <p>Results</p> <p>Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.</p> <p>Conclusions</p> <p>We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at <url>http://biosemantics.org/casper</url>.</p

    Thesaurus-based disambiguation of gene symbols

    Get PDF
    BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. CONCLUSION: The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications

    Electrocardiographic Criteria for Left Ventricular Hypertrophy in Children

    Get PDF
    Previous studies to determine the sensitivity of the electrocardiogram (ECG) for left ventricular hypertrophy (LVH) in children had their imperfections: they were not done on an unselected hospital population, several criteria used in adults were not applied to children, and obsolete limits of normal for the ECG parameters were used. Furthermore, left ventricular mass (LVM) was taken as the reference standard for LVH, with no regard for other clinical evidence. The study population consisted of 832 children from whom a 12-lead ECG and an M-mode echocardiogram were taken on the same day. The validity of the ECG criteria was judged on the basis of an abnormal LVM index, either alone or in combination with other clinical evidence. The ECG criteria were based on recently established age-dependent normal limits. At 95% specificity, the ECG criteria have low sensitivities (<25%) when an elevated LVM index is taken as the reference for LVH. When clinical evidence is also taken into account, the sensitivity improved considerably (<43%). Sensitivities could be further improved when ECG parameters were combined. The sensitivity of the pediatric ECG in detecting LVH is low but depends strongly on the definition of the reference used for validation

    Measurement of coronary calcium scores or exercise testing as initial screening tool in asymptomatic subjects with ST-T changes on the resting ECG: an evaluation study

    Get PDF
    Background: Asymptomatic subjects at intermediate coronary risk may need diagnostic testing for risk stratification. Both measurement of coronary calcium scores and exercise testing are well established tests for this purpose. However, it is not clear which test should be preferred as initial diagnostic test. We evaluated the prevalence of documented coronary artery disease (CAD) according to calcium scores and exercise test results. Methods: Asymptomatic subjects with ST-T changes on a rest ECG were selected from the population based PREVEND cohort study and underwent measurement of calcium scores by electron beam tomography and exercise testing. With calcium scores ≥10 or a positive exercise test, myocardial perfusion imaging (MPS) or coronary angiography (CAG) was recommended. The primary endpoint was documented obstructive CAD (≥50% stenosis). Results: Of 153 subjects included, 149 subjects completed the study protocol. Calcium scores ≥400, 100-399, 10-99 and <10 were found in 16, 29, 18 and 86 subjects and the primary endpoint was present in 11 (69%), 12 (41%), 0 (0%) and 1 (1%) subjects, respectively. A positive, nondiagnostic and negative exercise test was present in 33, 27 and 89 subjects and the primary endpoint was present in 13 (39%), 5 (19%) and 6 (7%) subjects, respectively. Receiver operator characteristics analysis showed that the area under the curve, as measure of diagnostic yield, of 0.91 (95% CI 0.84-0.97) for calcium scores was superior to 0.74 (95% CI 0.64-0.83) for exercise testing (p = 0.004). Conclusion: Measurement of coronary calciu

    Multi-component based cross correlation beat detection in electrocardiogram analysis

    Get PDF
    BACKGROUND: The first stage in computerised processing of the electrocardiogram is beat detection. This involves identifying all cardiac cycles and locating the position of the beginning and end of each of the identifiable waveform components. The accuracy at which beat detection is performed has significant impact on the overall classification performance, hence efforts are still being made to improve this process. METHODS: A new beat detection approach is proposed based on the fundamentals of cross correlation and compared with two benchmarking approaches of non-syntactic and cross correlation beat detection. The new approach can be considered to be a multi-component based variant of traditional cross correlation where each of the individual inter-wave components are sought in isolation as opposed to being sought in one complete process. Each of three techniques were compared based on their performance in detecting the P wave, QRS complex and T wave in addition to onset and offset markers for 3000 cardiac cycles. RESULTS: Results indicated that the approach of multi-component based cross correlation exceeded the performance of the two benchmarking techniques by firstly correctly detecting more cardiac cycles and secondly provided the most accurate marker insertion in 7 out of the 8 categories tested. CONCLUSION: The main benefit of the multi-component based cross correlation algorithm is seen to be firstly its ability to successfully detect cardiac cycles and secondly the accurate insertion of the beat markers based on pre-defined values as opposed to performing individual gradient searches for wave onsets and offsets following fiducial point location

    Mining for diagnostic information in body surface potential maps: A comparison of feature selection techniques

    Get PDF
    BACKGROUND: In body surface potential mapping, increased spatial sampling is used to allow more accurate detection of a cardiac abnormality. Although diagnostically superior to more conventional electrocardiographic techniques, the perceived complexity of the Body Surface Potential Map (BSPM) acquisition process has prohibited its acceptance in clinical practice. For this reason there is an interest in striking a compromise between the minimum number of electrocardiographic recording sites required to sample the maximum electrocardiographic information. METHODS: In the current study, several techniques widely used in the domains of data mining and knowledge discovery have been employed to mine for diagnostic information in 192 lead BSPMs. In particular, the Single Variable Classifier (SVC) based filter and Sequential Forward Selection (SFS) based wrapper approaches to feature selection have been implemented and evaluated. Using a set of recordings from 116 subjects, the diagnostic ability of subsets of 3, 6, 9, 12, 24 and 32 electrocardiographic recording sites have been evaluated based on their ability to correctly asses the presence or absence of Myocardial Infarction (MI). RESULTS: It was observed that the wrapper approach, using sequential forward selection and a 5 nearest neighbour classifier, was capable of choosing a set of 24 recording sites that could correctly classify 82.8% of BSPMs. Although the filter method performed slightly less favourably, the performance was comparable with a classification accuracy of 79.3%. In addition, experiments were conducted to show how (a) features chosen using the wrapper approach were specific to the classifier used in the selection model, and (b) lead subsets chosen were not necessarily unique. CONCLUSION: It was concluded that both the filter and wrapper approaches adopted were suitable for guiding the choice of recording sites useful for determining the presence of MI. It should be noted however that in this study recording sites have been suggested on their ability to detect disease and such sites may not be optimal for estimating body surface potential distributions

    Prevalence of macrovascular disease amongst type 2 diabetic patients detected by targeted screening and patients newly diagnosed in general practice: the Hoorn Screening Study

    Get PDF
    Prevalence of macrovascular disease amongst type 2 diabetic patients detected by targeted screening and patients newly diagnosed in general practice: the Hoorn Screening Study. Spijkerman AM, Henry RM, Dekker JM, Nijpels G, Kostense PJ, Kors JA, Ruwaard D, Stehouwer CD, Bouter LM, Heine RJ. Institutes for Research in Extramural Medicine, VU University Medical Center, Amsterdam, The Netherlands. [email protected] OBJECTIVES: Screening for type 2 diabetes has been recommended and targeted screening might be an efficient way to screen. The aim was to investigate whether diabetic patients identified by a targeted screening procedure differ from newly diagnosed diabetic patients in general practice with regard to the prevalence of macrovascular complications. DESIGN: Cross-sectional population-based study. SETTING: Population study, primary care. SUBJECTS: Diabetic patients identified by a population-based targeted screening procedure (SDM patients), consisting of a screening questionnaire and a fasting capillary glucose measurement followed by diagnostic testing, were compared with newly diagnosed diabetic patients in general practice (GPDM patients). Ischaemic heart disease and prior myocardial infarction were assessed by ECG recording. Peripheral arterial disease was assessed by the ankle-arm index. Intima-media thickness of the right common carotid artery was measured with ultrasound. RESULTS: A total of 195 SDM patients and 60 GPDM patients participated in the medical examination. The prevalence of MI was 13.3% (95% CI 9.3-18.8%) and 3.4% (1.0-11.7%) in SDM patients and GPDM patients respectively. The prevalence of ischaemic heart disease was 39.5% (95% CI 32.9-46.5%) in SDM patients and 24.1% (15.0-36.5%) in GPDM patients. The prevalence of peripheral arterial disease was similar in both groups: 10.6% (95% CI 6.9-15.9%) and 10.2% (4.7-20.5%) respectively. Mean intima-media thickness was 0.85 mm (+/-0.17) in SDM patients and 0.90 mm (+/-0.20) in GPDM patients. The difference in intima-media thickness was not statistically significant. CONCLUSIONS: Targeted screening identified patients with a prevalence of macrovascular complications similar to that of patients detected in general practice, but with a lower degree of hyperglycaemi

    Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation

    Get PDF
    BACKGROUND: High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. RESULTS: The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. CONCLUSION: Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose

    Dataset of manually measured QT intervals in the electrocardiogram

    Get PDF
    BACKGROUND: The QT interval and the QT dispersion are currently a subject of considerable interest. Cardiac repolarization delay is known to favor the development of arrhythmias. The QT dispersion, defined as the difference between the longest and the shortest QT intervals or as the standard deviation of the QT duration in the 12-lead ECG is assumed to be reliable predictor of cardiovascular mortality. The seventh annual PhysioNet/Computers in Cardiology Challenge, 2006 addresses a question of high clinical interest: Can the QT interval be measured by fully automated methods with accuracy acceptable for clinical evaluations? METHOD: The PTB Diagnostic ECG Database was given to 4 cardiologists and 1 biomedical engineer for manual marking of QRS onsets and T-wave ends in 458 recordings. Each recording consisted of one selected beat in lead II, chosen visually to have minimum baseline shift, noise, and artifact. In cases where no T wave could be observed or its amplitude was very small, the referees were instructed to mark a 'group-T-wave end' taking into consideration leads with better manifested T wave. A modified Delphi approach was used, which included up to three rounds of measurements to obtain results closer to the median. RESULTS: A total amount of 2*5*548 Q-onsets and T-wave ends were manually marked during round 1. To obtain closer to the median results, 8.58 % of Q-onsets and 3.21 % of the T-wave ends had to be reviewed during round 2, and 1.50 % Q-onsets and 1.17 % T-wave ends in round 3. The mean and standard deviation of the differences between the values of the referees and the median after round 3 were 2.43 ± 0.96 ms for the Q-onset, and 7.43 ± 3.44 ms for the T-wave end. CONCLUSION: A fully accessible, on the Internet, dataset of manually measured Q-onsets and T-wave ends was created and presented in additional file: 1 (Table 4) with this article. Thus, an available standard can be used for the development of automated methods for the detection of Q-onsets, T-wave ends and for QT interval measurements
    corecore